Skip to content

Conversation

@michalpasztamobica
Copy link
Contributor

@michalpasztamobica michalpasztamobica commented Dec 20, 2019

Summary of changes

Fixes #11544.

EDIT 30.12.2019: Changed the description quite heavily, therefore did not leave the old part stroke through, see history if you want to see the original description.

EDIT 9.01.2020: Now I updated the description below.

The original issue report points out that in case the ESP8266 fails to respond within serial timeout period (2 seconds), there is no way of telling if the packet was sent or not. Therefore if the application decides to retry sending it might be that packet gets sent out twice.

Following the remarks from @ccli8 and from @kjbracey-arm I propose the following:

  1. ESP8266 driver checks if the undocumented Recvd x bytes message arrives. This basically acknowledges successful UART transfer. If it doesn't come or if any other serial-related error pops up, we can assume the ESP is an unknown state and better be reset - we return DEVICE_ERROR if only the SEND OK is missing, we return WOULD_BLOCK and can handle the acknowledgement asynchronously via an OOB.

2) If the serial acknowledge arrives, we can wait a while for SEND OK message (although, according to @kjbracey-arm 's explanation, we don't really have to). There must be some timeout to this wait. I added a counter that allows 3 blocking checks for SEND OK in case of busy s... message.
@AnttiKauppila , I moved your recently added retry mechanism waiting for SEND OK. Now that we check for serial acknowledgement it makes sense to do this wait a bit later. I also added a limit to it (3 retries), please bear this in mind during review.

  1. @ccli8 has proposed more fixes, so this PR also fixes the reopening of the socket (with the possible SEND FAIL in mind, introduces per-socket failure flag and removed the large chunk of code which was holding send() back waiting for busy or SEND OK/FAIL - @AnttiKauppila , please pay attention to this part when reviewing as this removes your recent changes.

I checked that all greentea tests which are passing on master are also passing with this PR (CI is not running ESP8266) and the mbed-os-example-pelion connects fine. I noticed however that netsocket-udp test suite takes much longer to pass in RAAS (more than 1100 seconds), so the timeout was increased accordingly.

Impact of changes

The send() calls might take longer, but success rate will increase.

Migration actions required

None

Documentation

None


Pull request type

[x] Patch update (Bug fix / Target update / Docs update / Test update / Refactor)
[] Feature update (New feature / Functionality change / New API)
[] Major update (Breaking change E.g. Return code change / API behaviour change)

Test results

[] No Tests required for this change (E.g docs only update)
[x] Covered by existing mbed-os tests (Greentea or Unittest)
[] Tests / results supplied as part of this PR

Reviewers

@ccli8
@AnttiKauppila
@VeijoPesonen
@SeppoTakalo


@ciarmcom
Copy link
Member

@michalpasztamobica, thank you for your changes.
@ccli8 @AnttiKauppila @SeppoTakalo @VeijoPesonen @ARMmbed/mbed-os-ipcore @ARMmbed/mbed-os-maintainers please review.

@ccli8
Copy link
Contributor

ccli8 commented Dec 23, 2019

NSAPI_ERROR_WOULD_BLOCK means the upper layer can re-try the failed data send, so return code with it must be deliberated upon to avoid duplicate data sends:

Data not written to ESP8266 yet and ESP8266 replying busy, this is the most appropriate condition to return NSAPI_ERROR_WOULD_BLOCK as error code:

if (!_parser.recv(">")) {
        _parser.remove_oob("OK");
        if (_busy) {
            if (_ok_received) {
                tr_debug("send(): _ok_received.");
                goto RETRY;
            } else if (_parser.recv("OK")) {
                tr_debug("send(): parser found OK");
                goto RETRY;
            }
        }
        tr_debug("send(): Didn't get \">\"");
        ret = NSAPI_ERROR_WOULD_BLOCK;
        goto END;
    }

Data written to ESP8266 but failed, it is OK to return NSAPI_ERROR_DEVICE_ERROR to fail this connection. Not sure if ESP8266 would still buffer this failed data write for its send out, so don't return NSAPI_ERROR_WOULD_BLOCK as error code:

    if (_parser.write((char *)data, (int)amount) < 0) {
        tr_debug("Failed to write data");
        ret = NSAPI_ERROR_DEVICE_ERROR;
    }

    // The "Recv X bytes" is not documented.
    int bytes_confirmed;
    if (!_parser.recv("Recv %d bytes", &bytes_confirmed)) {
        tr_debug("Bytes not confirmed.");
        ret = NSAPI_ERROR_DEVICE_ERROR;
    } else if (bytes_confirmed != amount) {
        tr_debug("Error: confirmed %d bytes, but expected %d.", bytes_confirmed, amount);
        ret = NSAPI_ERROR_DEVICE_ERROR;
    }

Converting error code to NSAPI_ERROR_WOULD_BLOCK here is dangerous because it can cause duplicate data sends:

// error hierarchy, from low to high
    if (_busy) {
        ret = NSAPI_ERROR_WOULD_BLOCK;
        tr_debug("send(): Modem busy. ");
    }

    if (ret == NSAPI_ERROR_DEVICE_ERROR) {
        ret = NSAPI_ERROR_WOULD_BLOCK;
        tr_debug("send(): Send failed.");
    }

@michalpasztamobica
Copy link
Contributor Author

Thanks for your review, @ccli8 . Indeed, the code makes more sense with the three goto ENDs, thanks for pointing this out.

Regarding the NSAPI_ERROR_WOULD_BLOCK return value, I am not sure if it necessarily indicates that a resend should be performed. According to the documentation and previous solutions it is NSAPI_ERROR_WOULD_BLOCK that should be returned in case ESP returns Busy. I assume application layers also expect just this error code when the operation blocks for any reason. I went through a few other implementations of socket_send, but they usually just map every error code to NSAPI_ERROR_DEVICE_ERROR (ESP32, ism43362). I have noticed however that LWIP::socket_send() can potentially return NSAPI_ERROR_IN_PROGRESS.

This makes some sense to me, but this would be a breaking change, so I would much appreciate more feedback...
@kjbracey-arm, would it be acceptable for ESP8266Interface::socket_send() to return NSAPI_ERROR_IN_PROGRESS in case the module reports Busy s...? This would indicate, that the operation is ongoing and there's no point resending the data. Then a sigio call would let the user know, that it's possible to move on, but currently it will just trigger on any serial activity, so I guess we would have to refine it a little bit...
@teetak01 , would this make any difference for the client applications?

@kjbracey
Copy link
Contributor

@kjbracey-arm, would it be acceptable for ESP8266Interface::socket_send() to return NSAPI_ERROR_IN_PROGRESS in case the module reports Busy s...?

No. Either the send call accepts the data for sending, or it doesn't. The precise error code doesn't affect that.

"In progress" as a return code does exist, but it's not applicable for data transfer calls. That is used for "connect" where a "0" return means "I am connected", so there needs to be a separate return for "I have now started connecting, but I'm not connected yet" in the non-blocking case. ("Would block" isn't used because it HAS accepted the connect request, and it's now ongoing. "Would block" would mean it HADN'T accepted the connect request, and state hadn't changed).

For the data transfer calls, all UDP and TCP's non-error responses means "I've accepted the data for sending". So every non-negative return already has the meaning "in progress". There's no completion to indicate.

Read all the discussion about send return codes here: #12083 This is quite similar.

From that discussion, if you believe the modem hasn't accepted the data, then "would block" makes sense for TCP. It's better to use "no memory" for UDP, and/or do some internal retries.

(You can't return anything other than "would block" for TCP, as any other error indicates connection breakage. It's a reliable transport, so can't return errors except for the pseudo-error "would block", unless the connection breaks.)

If your belief that the modem didn't here you turns out to be wrong for UDP, then you can get a duplicate packet, but that should(TM) be harmless. UDP is allowed to duplicate packets - UDP apps should(TM) be tolerant.

If that belief turns out to be wrong for TCP, then the connection is corrupted. If you ever find out (can you?) that you accidentally got the modem to send something twice, you should treat the connection as reset. Do a "force reset" close (if you can) to the modem - if you can't just close it - and give the application a "connection reset" error.

Now, I don't really follow what's going on here, but if you're treating the "recvd X bytes" message as the "modem has got the data" completion, then, yes, you don't need to necessarily wait for "SEND OK".

There's no requirement for UDP that you hang around to see if the data was transmitted. A packet is allowed to be dropped later down the chain after an initial "OK" response.

For TCP, a problem would at least need to be signalled later. It would be acceptable to return "OK" on a send that later failed as long as that failure triggered an error from the next call. The standard "OK" from TCP only ever means "I put it in the send buffer" in a normal implementation, it's never meant it actually got sent. You only really find out if buffered data was sent if you complete a clean close handshake.

The only issue is that if the "SEND OK" is what completes the AT command, you do presumably need to not attempt to send anything else before you get it?

@michalpasztamobica
Copy link
Contributor Author

I just submitted another commit which adjusts the situation to what @kjbracey-arm suggested: return WOULD_BLOCK for TCP or NO_MEMORY for UDP and throw other errors when connection should be considered broken.

I also found that we were silently truncating buffers larger than 2048 B, claiming we have sent the whole packet, so I fixed this, too. (Our tests did not test this scenario, not even sure it's a real one). At the same time ESP8266 is now able to handle a partial send (Recv X bytes, where X is smaller that we we actually tried to send), if that scenario ever occurs.

@ccli8 , ESP8266's serial mechanism and documentation do not allow for a 100% reliable transportation. I think the current code is the most reasonable situation. I extended the timeout that waits for SEND OK, so the cloud-example scenario passes. I hope this is OK for you?

@ccli8
Copy link
Contributor

ccli8 commented Dec 30, 2019

I extended the timeout that waits for SEND OK, so the cloud-example scenario passes.

@michalpasztamobica This doesn't avoid duplicate tcp data send. My concern is that NSAPI_ERROR_WOULD_BLOCK shouldn't return when tcp data has written to ESP8266 modem (buffered or sent out). This would misguide upper layer to re-send the tcp data, which actually has buffered in ESP8266 modem (not receiving SEND OK) or sent out (receiving SEND OK).

According to your above comment, SEND OK must follow data write for least change. For that, when tcp data has written but SEND OK times out, see the connection as broken and return error code other than NSAPI_ERROR_WOULD_BLOCK to indicate connection broken. Of course, this has side effect: breaks the connection prematurely when network status is not so good. But if NSAPI_ERROR_WOULD_BLOCK is returned, the connection will get data corrupted and upper layer needs more time to detect it out.

Code snippet below with NSAPI_ERROR_WOULD_BLOCK return code should be deliberated upon:

// error hierarchy, from low to high
    if (_busy) {
        ret = NSAPI_ERROR_WOULD_BLOCK;
        tr_debug("send(): Modem busy. ");
    }

    if (ret == NSAPI_ERROR_DEVICE_ERROR) {
        ret = NSAPI_ERROR_WOULD_BLOCK;
        tr_debug("send(): Send failed.");
    }

@michalpasztamobica
Copy link
Contributor Author

Update code and description to avoid duplicate packets (if the SEND OK does not arrive, any subsequent call to send will immediatelly return WOULD_BLOCK).
@VeijoPesonen , thanks for the review. I also included your remarks.

// The "Recv X bytes" is not documented.
if (!_parser.recv("Recv %d bytes", &bytes_confirmed)) {
tr_debug("send(): Bytes not confirmed.");
ret = NSAPI_ERROR_DEVICE_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed because TCP and UDP are the only protocols supported.


// ESP8266 ACKed data over serial, but did not ACK over TCP or report any error.
_prev_send_ok_pending = true;
_parser.oob("SEND OK", callback(this, &ESP8266::_oob_send_ok_received));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see OOB handler added here but I don't see one removed anywhere in the code.

We are now checking if ESP8266 has confirmed receiving data over serial
port with an undocumented (but existing) "Recv x bytes" message. Next we
are explicitly waiting for an official "SEND OK".
@michalpasztamobica michalpasztamobica changed the title ESP8266: add a retry mechanism to avoid duplicate data sends ESP8266: Avoid duplicate data sends Dec 31, 2019
@michalpasztamobica
Copy link
Contributor Author

@VeijoPesonen , @ccli8 thanks for review, I hope I addressed both your remarks.

Regarding the oob adding/removal. SEND OK and SEND FAIL only make sense when we try to send something. More precisely - after we get a confirmation from ESP8266 that the data has been received over serial. Therefore I would stick to registering them in send(). Those two messages are mutually exclusive, so we expect that only one of them arrives and that it arrives only once per send(). Hence - it is best to remove both oobs when any of the two messages arrives.

Regarding the status instead of flags - I agree with you, @ccli8 and I added a variable to handle this (although a bit different than you suggested).

There is one more thing I noticed. With waiting for serial ACK and then for SEND OK, as per documentation the netsocket-udp test takes more than 1100 seconds (it used to take less then 200 seconds). I do see that the echotest_burst and sendto_repeat are particularly slower, but also - that they pass more often. More than that - when I run the very same code on my desk in a quiet environment the same tests runs for just 70 seconds! It seems that ESP8266 has some mechanism of confirming the UDP sends, that we have so far ignored, but now take into consideration (waiting for serial Recvd X bytes and for SEND OK) and it gets way slower. I still think this is the way to go.

@ccli8
Copy link
Contributor

ccli8 commented Jan 2, 2020

@michalpasztamobica With _send_status per device, one socket send error can cause all other socket sends to fail. Can _send_status be bound to the socket of the latest ESP8266::send(...) call? So one socket send error doesn't propagate to all other sockets.

Now that ESP8266::send(...) can get pending, socket close can fail with busy, right? Then meet socket OOM error in next socket open because socket is not recycled in socket close.

bool ESP8266::close(int id)
{
    //May take a second try if device is busy
    for (unsigned i = 0; i < 2; i++) {
        _smutex.lock();
        if (_parser.send("AT+CIPCLOSE=%d", id)) {

@michalpasztamobica
Copy link
Contributor Author

@AnttiKauppila , @VeijoPesonen , @SeppoTakalo we need your judgement with the last open issue in this PR. No need to ready through the whole discussion, don't worry ;-).

The last issue we have left is that currently ESP8266Interface::socket_close will mark the socket closed and return DEVICE_ERROR if the ESP8266 failed to close the socket (for example, because it was busy):

int ESP8266Interface::socket_close(void *handle)
{
 ...
    if (socket->connected && !_esp.close(socket->id)) {
        ret = NSAPI_ERROR_DEVICE_ERROR;
    }
    // mark the socket as closed, even thouth _esp.close() returned false.
    return ret;

I think this is a bug and that we should return DEVICE_ERROR immediately and not set the open flag to false (thus claiming the socket has been closed even though it hasn't). Application should retry to close or reset the device if it is not possible, before opening the socket anew.

@ccli8 thinks that socket_close() is doing the right thing and applications will mostly ignore the close() return value anyway. Therefore he proposed that we call close() on every open_tcp/upd, in case we failed to close it correctly before. @ccli8 , please correct me if I put anything incorrectly here.

Can you please help us decide which is the right way here?

@SeppoTakalo
Copy link
Contributor

Oh my.. ESP never ceases to amaze me...

When Socket::close() fails... does it mean that you could continue to use the socket? If you allow that, you just open a new can of worms..

Socket has only one error code, that allows you to continue using it, the WOULD_BLOCK, which is a bit special. But for all others, the right procedure to react is to close the socket and delete it.

I would even claim that DEVICE_ERROR is a type of error where right procedure would be to shut down the whole network interface and restart it. It should not be returned from device, if the conditions allow it to proceed. But, unfortunately, it has been used in cases where there is no other generic error codes available.

But for the close.. I would still allow user to call Socket::close() and whether it fails or not, I would not expect anyone to retry that operation. Just delete the object.
Then in a failing condition, it leaves one open socket on ESP side, and in the future application is going to call Socket::open() in which case the driver sees that this socket ID is already taken. That condition should lead the driver to try cleaning up, once again. If success, then a new ID can be returned. If it fails, it can return a failure to application, which will retry on later time, or reset the whole device.

Another reason not to expect anyone retrying Socket:close() is that this function is often called in destructor. It is perfectly legal to create a new socket, open and use it. But instead of closing, just delete it. It triggers all the same shutdown mechanisms. There are many cases, where you know that both ends of the connection know that all data is sent&received and you can happily destroy it, or maybe the server side already initiated the shutdown.

@michalpasztamobica
Copy link
Contributor Author

Thanks a lot, @SeppoTakalo
@ccli8 , it turns out you were right. The current version of the commit can go in as is. I can cherry-pick it to my branch if that's OK with you?
Just minor request - can you please remove the send_status enum from ESP8266.h? We don't need it any more.

@ccli8
Copy link
Contributor

ccli8 commented Jan 9, 2020

@michalpasztamobica I've removed send_status enum. Feel free to cherry-pick my branch.

1.  Fix 'spurious close' by adding close() in open(). 'spurious close' gets frequent and cannot ignore when send() changes to asynchronous. User can retry open() until 'spurious close' gets true.
2.  Allow only one actively sending socket because:
    (1) ESP8266 AT packets 'SEND OK'/'SEND FAIL' are not associated with socket ID. No way to tell them.
    (2) In original implementation, ESP8266::send() is synchronous, which implies only one actively sending socket.
3.  Register 'SEND OK'/'SEND FAIL' oobs, like others in ESP8266::ESP8266 constructor. Don't get involved in oob management with send status because ESP8266 modem possibly doesn't reply these packets on error case.
4.  Now that ESP8266::send() changes to asynchronous, drop the code with _parser.recv("SEND OK")/_parser.recv("SEND FAIL"). _parser.recv("SEND OK")/_parser.recv("SEND FAIL") and 'SEND OK'/'SEND FAIL' oobs both consume 'SEND OK'/'SEND FAIL' packets and complicate flow control.
@0xc0170
Copy link
Contributor

0xc0170 commented Jan 9, 2020

Ci started

@mbed-ci
Copy link

mbed-ci commented Jan 9, 2020

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 1
Build artifacts

@michalpasztamobica
Copy link
Contributor Author

Aside from the official CI I run netsocket-* and network-* tests locally and they all passed (some DNS tests failed if I had logs enabled, probably because the oobs logs got quite heavy).

@michalpasztamobica
Copy link
Contributor Author

@AnttiKauppila , @SeppoTakalo , @VeijoPesonen , could at least one of you look through the changes ang give your approval?

if (id == _sock_sending_id) {
_sock_sending_id = -1;
}
_sock_i[id].send_fail = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same snippet has been added at least to 6 different places so maybe it should be turned into a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a function and replaced the reoccurent code.

if (_sock_sending_id >= 0 && _sock_sending_id < SOCKET_COUNT) {
if (!_sock_i[id].send_fail) {
tr_debug("send(): Previous packet (socket %d) was not yet ACK-ed with SEND OK.", _sock_sending_id);
return NSAPI_ERROR_WOULD_BLOCK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if modem doesn't reply with anything? Might SEND OK or SEND FAIL get lost? I'm just thinking out loud here so this isn't a request to change anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed in the main thread. Please see this comment and the previous one.
Long story short: ERROR is another possibility aside from SEND OK and SEND FAIL, but as experiments showed it is a recoverable error, so we basically ignore it.
If neither SEND OK nor SEND FAIL are coming we just keep returning WOULD_BLOCK to any new send() attempts. It's up to the application to decide how long this can be tolerated. I assume some socket timeout will take care of this in a typical mbed application?

goto END;
} else if (bytes_confirmed != amount) {
tr_debug("send(): Error: confirmed %d bytes, but expected %d.", bytes_confirmed, amount);
ret = NSAPI_ERROR_DEVICE_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok in case of TCP that ESP8266 accepts less data than we are trying to send?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well, but wasn't brave enough to implement it ;-). I have never seen this situation to happen, so I assume it's either all or nothing, but you are right, if ESP ever decides to accept partial write, we should be able to handle it.

@mbed-ci
Copy link

mbed-ci commented Jan 10, 2020

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 2
Build artifacts

@0xc0170 0xc0170 merged commit f4842c7 into ARMmbed:master Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-version: 6.0.0-alpha-2 Second pre-release version of 6.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESP8266: Pelion connect fails with 'send tcp data' timeout

9 participants